Dbt setup #4011

zschira · 2025-01-13T19:30:47Z

Overview

This PR sets up a dbt project within the PUDL repo that will be used for data testing. Details on setup and usage can all be found in the Readme.md. This PR also includes several data validations that have been converted to dbt tests. The tests currently converted are all vcerare asset_checks and FERC fuel by plant cost per mmbtu range checks.

In scope

Standardize organization and naming conventions
Develop some basic macros that will enable migrating existing tests
- Weighted quantile
- Review validate.py for any other obvious candidates
Auto-generate row count checks
Update readme with directions for migration (how to add new asset, types of test in scope for immediate migration)

Out of scope

Make tests run in asset checks
Plan alternative to complex tests like the existing vs_historical
Migrate tests (see Migrate existing validation tests to dbt #4076)

closes #3997 #3971

…_setup

… dbt_setup

src/pudl/dbt/.user.yml

src/pudl/dbt/.gitignore

pyproject.toml

environments/conda-linux-64.lock.yml

zaneselvans · 2025-01-13T22:16:24Z

src/pudl/dbt/models/schema.yml

+    tables:
+      - name: out_eia923__boiler_fuel
+      - name: out_eia923__monthly_boiler_fuel
+      - name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402
+      - name: out_vcerare__hourly_available_capacity_factor


In a data warehouse with hundreds of tables, would this file be created and managed by hand? Or would there be some rule-based way to generate it, or parts of it, along the lines of what we're doing with the Pandera schema checks right now? For example, the not_null tests here are a 2nd place that that restriction is being specified -- it's already present in our table metadata, which seems like recipe for them getting out of sync.

Or in the case of row counts, is there a clean, non-manual way to update the row counts to reflect whatever the currently observed counts are? Especially if we're trying to regenerate expected row counts for each individual year, filling it all in manually could be pretty tedious and error prone. We've moved toward specifying per-year row counts on the newer assets so that they work transparently in either the fast or full ETL cases, and the asset checks don't need to be aware of which kind of job they're being run in, which seems both more specific and more robust.

Looks like the "X column is not null" checks are currently defined in fields.py under the field constraints, is that what you're thinking about?

I think it would be nice to have auto-generated tests like the non-null tests & row counts defined alongside manually added tests. Then all the tests will be defined in one place, except for the tests that we need to write custom Python code for.

That seems pretty doable - YAML is easy to work with, and dbt lets us tag tests, so we could easily tag all the auto-generated tests so our generation scripts know to replace them but leave the manually-added tests alone.

In addition to the field specific constraints I think we automatically add NOT NULL check constraints to the PK fields when we construct the SQLite database -- but more generally I'm just saying that we need to get all of these generated tests integrated non-duplicatively into the dbt tests somehow.

It seems totally possible to auto-generate tests, but I think there's also probably many ways to do accomplish this, so we should figure out what we want from it. For example, when we talk about auto-generating row count/not null tests, will these be generated once and committed into the repo, or will some/all of them be dynamically generated at runtime?

It definitely seems tricky to minimize duplication between dbt/our existing python schema info. I also wonder how this plays into any refactoring of our metadata system?

It feels like we may need to clearly define the data tests that are ready to be migrated in a straightforward way, and the things that still need design work, so we can point Margay folks at the stuff that's ready to go and keep thinking about the things that still need some scaffolding?

zaneselvans · 2025-01-13T22:20:20Z

src/pudl/dbt/models/validation/ferc1_fbp_cost_per_mmbtu.sql

If we do end up needing to define these intermediate tables it seems like we would want to have some kind of clear naming convention for them?

Yeah I think that seems like a good idea. Maybe just use a validation_ prefix and otherwise follow existing naming conventions?

zaneselvans · 2025-01-13T22:22:56Z

src/pudl/dbt/models/schema.yml

+          - dbt_expectations.expect_compound_columns_to_be_unique:
+              column_list: ["county_id_fips", "datetime_utc"]


Could be generated based on the PK that's defined for every table?

Should be possible. We can also probably come up with a way to generate foreign key checks so we can actually verify foreign keys for tables only in parquet

zaneselvans · 2025-01-13T22:24:08Z

src/pudl/dbt/models/schema.yml

+          - dbt_expectations.expect_table_row_count_to_equal:
+              value: |
+                {%- if target.name == "etl-fast" -%} 27287400
+                {%- else -%} 136437000
+                {%- endif -%}


Is there a clean way to specify the expected row counts for each year of data (or some other meaningful subset) within a table, as we've started doing for the newer assets in Dagster asset checks, so we don't have to differentiate between fast and full validations, and can identify where the changes are?

We'd probably need to create a custom macro for this, but that seems totally doable. Big question is how we want to generate/store all of those tests.

The row count tests have functionally become regression tests -- we want to know when they change, and verify that the magnitude and nature of the change is expected based on the code or data that we've changed. Given that there are hundreds of tables (and thousands of table-years) it doesn't seem practical to hand-code all of the expected row counts.

It would be nice to have the per table-year row counts stored in (say) YAML somewhere, and be able to generate a new version of that file based on current ETL outputs. Then we could look at the diffs between the old and the new versions of the file when trying to assess changes in the lengths of the outputs.

zaneselvans · 2025-01-13T22:26:04Z

src/pudl/dbt/models/schema.yml

+          - dbt_expectations.expect_column_quantile_values_to_be_between:
+              quantile: 0.05
+              min_value: 1.5


I'm guessing these are not using the weighted quantiles?

Yeah this are just basic quantiles. It's not too hard to get a sql query that can do a version of weighted quantiles, but the existing vs_historical tests are hard because they're computing a bunch of quantiles, then comparing them all

… dbt_setup

marianneke · 2025-02-05T00:38:51Z

dbt/README.md

+### Adding new tables
+To add a new table to the project, you must add it as a
+[dbt source](https://docs.getdbt.com/docs/build/sources). The standard way to do
+this is to create a new file `models/{data_source}/{table_name}.yml`. If the the


@zschira I am confused by the difference between this instruction that tells users to add the table schema as an individual yml file with the table name and the instructions below for adding tests, where you mention a schema.yml file that would presumably contain all models.

…ow counts test

zaneselvans

See my revised comments from my first review. Only blocking issues are documentation.

zschira · 2025-02-21T18:25:41Z

@zaneselvans I've added some updated documentation, let me know if there's anything else needed before merge!

I'll also continue to work through your individual comments in the PR and we can collect next steps as issues/tasks in this epic I've created: #4075.

zaneselvans

Maybe I'm confused but it seems like there's an error in the definition of ALL_TABLES?

Plus a few documentation clarifications. I tried to add suggestions where I though I knew what it meant? But I'm not sure.

devtools/dbt_helper.py

zaneselvans · 2025-02-21T19:02:26Z

devtools/dbt_helper.py

+    "--partition-column",
+    default="inferred",
+    type=str,
+    help="Column used to generate row count per partition test. If 'inferred' the script will attempt to infer a reasonable partitioning column.",


Gotta spell out what the default inferred behavior is here, and also how data is partitioned if you choose a non-default / non-inferred column.

devtools/dbt_helper.py

zaneselvans · 2025-02-21T19:15:03Z

devtools/dbt_helper.py

+def add_tables(**kwargs):
+    """Generate dbt yaml to add PUDL table(s) as dbt source(s).
+
+    The ``tables`` argument can either be a list of table names, a single table name,
+    or 'all'. If 'all' the script will generate configuration for all PUDL tables.
+
+    Note: if ``--clobber`` is set, any manually added configuration for tables
+    will be overwritten.
+    """


Having read through this a little more it seems like this really is only for the initial table configuration, since if you try to use it to update the table schema information later, it'll wipe out any other manually configured data tests. Is that right?

If so, then what's the guardrail that ensure the dbt schemas and the PUDL schemas are always in sync? If we mess up in the manual update or forget to to it entirely, will the dbt tests fail because the schemas don't match the data they're associated with?

… dbt_setup

Co-authored-by: Zane Selvans <[email protected]>

… dbt_setup

zschira and others added 6 commits January 9, 2025 17:11

Add basic dbt setup

bf40ffb

Update to dagster 1.9.7 & grpcio 1.67.1

9aac625

Setup multiple dbt profiles

415a113

Merge remote-tracking branch 'refs/remotes/origin/dbt_setup' into dbt…

ba32bd8

…_setup

Add all vcerare dbt tests

dc51c8f

Add more example dbt tests

590b02a

zschira requested a review from zaneselvans January 13, 2025 19:30

zschira marked this pull request as draft January 13, 2025 19:31

Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…

63e663a

… dbt_setup

zschira requested a review from e-belfer January 13, 2025 19:44

Merge changes from main and revert to python 3.12

d428b5d

zaneselvans reviewed Jan 13, 2025

View reviewed changes

zaneselvans and others added 9 commits January 14, 2025 10:44

Bump gdal to v3.10.1 bugfix release.

784cf96

Merge branch 'main' into dbt_setup

48a16e1

Merge branch 'main' into dbt_setup

6f45ba5

Update to dagster 1.9.9

ac41a41

Merge branch 'main' into dbt_setup

0ce1648

Merge branch 'main' into dbt_setup

c19cfd8

Reorganize dbt into multiple schema.yml files

6335e94

Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…

2585eca

… dbt_setup

Move dbt project to top level of repo

e24af8c

jdangerx mentioned this pull request Jan 27, 2025

Prepare first batch of validation tests for migration #4030

Closed

2 tasks

zschira added 3 commits January 30, 2025 16:09

Only set parquet path in dbt project once

1ed85b3

Standardize dbt maning scheme

e92f5be

Add more detail to README

5de9ebe

marianneke reviewed Feb 5, 2025

View reviewed changes

zschira added 4 commits February 5, 2025 12:46

Add script to generate dbt scaffolding and row count tests

da9ae93

Add documentation for dbt helper script

7461786

Add out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 to yearly r…

0d120c6

…ow counts test

Add weighted quantile test (broken)

3666360

zaneselvans requested changes Feb 21, 2025

View reviewed changes

This was referenced Feb 21, 2025

Switch to using dbt for data validation #4075

Open

Migrate existing validation tests to dbt #4076

Open

zschira added 2 commits February 21, 2025 12:55

Add better docstrings/help messages to dbt_helper script

c050114

Update dbt readme

8ab821e

Merge branch 'main' into dbt_setup

4f71d1a

zaneselvans self-requested a review February 21, 2025 18:45

zaneselvans requested changes Feb 21, 2025

View reviewed changes

zschira and others added 8 commits February 21, 2025 15:17

Minor docs fixes

3738507

Remove option from dbt helper script

89689d8

Add example to dbt helper command description

f5da280

Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…

31313e4

… dbt_setup

Update devtools/dbt_helper.py

640b750

Co-authored-by: Zane Selvans <[email protected]>

Update devtools/dbt_helper.py

c41a3ef

Co-authored-by: Zane Selvans <[email protected]>

Update devtools/dbt_helper.py

244b4da

Co-authored-by: Zane Selvans <[email protected]>

Update devtools/dbt_helper.py

0244e2a

Co-authored-by: Zane Selvans <[email protected]>

zaneselvans self-requested a review February 21, 2025 22:34

zschira and others added 3 commits February 21, 2025 17:40

Update descriptions of row count dbt test

30de865

Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…

b52577d

… dbt_setup

Reflow README text; use 4 threads by default.

14cb119

zaneselvans approved these changes Feb 22, 2025

View reviewed changes

zaneselvans enabled auto-merge February 22, 2025 04:37

zaneselvans added dbt Issues related to the data build tool aka dbt data-validation Issues related to checking whether data meets our quality expectations. testing Writing tests, creating test data, automating testing, etc. labels Feb 22, 2025

zaneselvans added this pull request to the merge queue Feb 22, 2025

Merged via the queue into main with commit 17e35a2 Feb 22, 2025
17 checks passed

zaneselvans deleted the dbt_setup branch February 22, 2025 06:06

		- dbt_expectations.expect_compound_columns_to_be_unique:
		column_list: ["county_id_fips", "datetime_utc"]

Uh oh!

Dbt setup #4011

Dbt setup #4011

Uh oh!

Conversation

zschira commented Jan 13, 2025 • edited by zaneselvans Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

In scope

Out of scope

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaneselvans left a comment

Choose a reason for hiding this comment

Uh oh!

zschira commented Feb 21, 2025

Uh oh!

zaneselvans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zschira commented Jan 13, 2025 •

edited by zaneselvans

Loading